System and method for supporting controlled re-routing in an infiniband (ib) network

ABSTRACT

A system and method can support controlled re-routing in an InfiniBand (IB) fabric. The fabric is associated with a subnet manager that can detect a connectivity change in the fabric, and re-rout the fabric accordingly. The subnet manager can ensure that only accredited components and connectivity are utilized in the re-routing, and represent the connectivity that is not accredited within a local subnet or sub-subnet. The subnet manager can further maintain a node record or fabric configuration for evaluating the detected connectivity change in the fabric.

CLAIM OF PRIORITY

This application claims the benefit of priority on U.S. Provisional Patent Application No. 61/493,330, entitled “STATEFUL SUBNET MANAGER FAILOVER IN A MIDDLEWARE MACHINE ENVIRONMENT” filed Jun. 3, 2011, which application is herein incorporated by reference.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD OF INVENTION

The present invention is generally related to computer systems, and is particularly related to supporting an InfiniBand (IB) network.

BACKGROUND

The interconnection network plays a beneficial role in the next generation of super computers, clusters, and data centers. High performance network technology, such as the InfiniBand (IB) technology, is replacing proprietary or low-performance solutions in the high performance computing domain, where high bandwidth and low latency are the key requirements. For example, IB installations are used in supercomputers such as Los Alamos National Laboratory's Roadrunner, Tex. Advanced Computing Center's Ranger, and Forschungszcntrum Juelich's JuRoPa.

IB was first standardized in October 2000 as a merge of two older technologies called Future I/O and Next Generation I/O. Due to its low latency, high bandwidth, and efficient utilization of host-side processing resources, it has been gaining acceptance within the High Performance Computing (HPC) community as a solution to build large and scalable computer clusters. The de facto system software for IB is OpenFabrics Enterprise Distribution (OFED), which is developed by dedicated professionals and maintained by the OpenFabrics Alliance. OFED is open source and is available for both GNU/Linux and Microsoft Windows.

SUMMARY

Described herein is a system and method that can support controlled re-routing in an InfiniBand (IB) fabric. The fabric is associated with a subnet manager that can detect a connectivity change in the fabric, and re-rout the fabric accordingly. The subnet manager can ensure that only accredited components and connectivity are utilized in the re-routing, and represent the connectivity that is not accredited within a local subnet or sub-subnet. The subnet manager can further maintain a node record or fabric configuration for evaluating the detected connectivity change in the fabric.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 shows an illustration of a fabric model in a middleware environment in accordance with an embodiment of the invention.

FIG. 2 shows an illustration of reconfiguring an InfiniBand (IB) fabric in accordance with an embodiment of the invention.

FIG. 3 shows an illustration of supporting a run-time mode for a subnet manager (SM) in an IB fabric in accordance with an embodiment of the invention.

FIG. 4 shows an illustration of supporting a SM that can maintain a record for known nodes in an IB fabric in accordance with an embodiment of the invention.

FIG. 5 shows an illustration of supporting a SM that can track an existing operational configuration in an IB fabric in accordance with an embodiment of the invention.

FIG. 6 shows an illustration of a SM that can maintain a blueprint fabric configuration in an IB fabric in accordance with an embodiment of the invention.

FIG. 7 illustrates an exemplary flow chart for supporting controlled re-routing in an IB fabric in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

Described herein is a system and method that can support controlled re-routing in an interconnected network, such as an InfiniBand (IB) network.

FIG. 1 shows an illustration of a fabric model in a middleware environment in accordance with an embodiment of the invention. As shown in FIG. 1, an interconnected network, or a fabric 100, can include switches 101-103, bridges and routers 104, host channel adapters (HCAs) 105-106 and designated management hosts 107. Additionally, the fabric can include, or be connected to, one or more hosts 108 that are not designated management hosts.

The designated management hosts 107 can be installed with HCAs 105, 106, a network software stack and relevant management software in order to perform network management tasks. Furthermore, firmware and management software can be deployed on the switches 101-103, and the bridges and routers 104 to direct traffic flow in the fabric. Here, the host HCA drivers, OS and Hypervisors on hosts 108 that are not designated management hosts may be considered outside the scope of the fabric from a management perspective.

The fabric 100 can be in a single media type, e.g. an IB only fabric, and be fully connected. The physical connectivity in the fabric ensures in-band connectivity between any fabric components in the non-degraded scenarios. Alternatively, the fabric can be configured to include Ethernet (Enet) connectivity outside gateway (GW) external ports on a gateway 109. Additionally, it is also possible to have independent fabrics operating in parallel as part of a larger system. For example, the different fabrics can only be indirectly connected via different HCAs or HCA ports.

InfiniBand (IB) Architecture

IB architecture is a serial point-to-point technology. Each of the IB networks, or subnets, can include a set of hosts interconnected using switches and point-to-point links. A single subnet can be scalable to more than ten-thousand nodes and two or more subnets can be interconnected using an IB router. The hosts and switches within a subnet are addressed using local identifiers (LIDs), e.g. a single subnet may be limited to 49151 unicast addresses.

An IB subnet can employ at least one subnet manager (SM) which is responsible for initializing and starting up the sub-net including the configuration of all the IB ports residing on switches, routers and host channel adapters (HCAs) in the subset. The SM's responsibility also includes routing table calculation and deployment. Routing of the network aims at obtaining full connectivity, deadlock freedom, and load balancing between all source and destination pairs. Routing tables can be calculated at network initialization time and this process can be repeated whenever the topology changes in order to update the routing tables and ensure optimal performance.

At the time of initialization, the SM starts in the discovering phase where the SM does a sweep of the network in order to discover all switches and hosts. During the discovering phase, the SM may also discover any other SMs present and negotiate who should be the master SM. When the discovering phase is completed, the SM can enter a master phase. In the master phase, the SM proceeds with LID assignment, switch configuration, routing table calculations and deployment, and port configuration. At this point, the subnet is up and ready to use.

After the subnet is configured, the SM can monitor the network for changes (e.g. a link goes down, a device is added, or a link is removed). If a change is detected during the monitoring process, a message (e.g. a trap) can be forwarded to the SM and the SM can reconfigure the network. Part of the reconfiguration process, or a heavy sweep process, is the rerouting of the network which can be performed in order to guarantee full connectivity, deadlock freedom, and proper load balancing between all source and destination pairs.

The HCAs in an IB network can communicate with each other using queue pairs (QPs). A QP is created during the communication setup, and a set of initial attributes such as QP number, HCA port, destination LID, queue sizes, and transport service are supplied. On the other hand, the QP associated with the HCAs in a communication is destroyed when the communication is over. An HCA can handle many QPs, each QP consists of a pair of queues, a send queue (SQ) and a receive queue (RQ). There is one such pair present at each end-node that is participating in the communication. The send queue holds work requests to be transferred to the remote node, while the receive queue holds information on what to do with the data received from the remote node. In addition to the QPs, each HCA can have one or more completion queues (CQs) that are associated with a set of send and receive queues. The CQ holds completion notifications for the work requests posted to the send and receive queue.

The IB architecture is a flexible architecture. Configuring and maintaining an IB subnet can be carried out via special in-band subnet management packets (SMPs). The functionalities of a SM can, in principle, be implemented from any node in the IB subnet. Each end-port in the IB subnet can have an associated subnet management agent (SMA) that is responsible for handling SMP based request packets that are directed to it. In the IB architecture, a same port can represent a SM instance or other software component that uses SMP based communication. Thus, only a well defined sub-set of SMP operations can be handled by the SMA.

SMPs use dedicated packet buffer resources in the fabric, e.g. a special virtual lane (VL15) that is not flow-controlled (i.e. SMP packets may be dropped in the case of buffer overflow. Also, SMPs can use either the routing that the SM sets up based on end-port Local Identifiers (LIDs), or SMPs can use direct routes where the route is fully defined by the sender and embedded in the packet. Using direct routes, the packet's path goes through the fabric in terms of an ordered sequence of port numbers on HCAs and switches.

The SM can monitor the network for changes using SMAs that are presented in every switch and/or every HCA. The SMAs communicate changes, such as new connections, disconnections, and port state change, to the SM using traps and notices. A trap is a message sent to alert end-nodes about a certain event. A trap can contain a notice attribute with the details describing the event. Different traps can be defined for different events. In order to reduce the unnecessary distribution of traps, IB applies an event forwarding mechanism where end-nodes are required to explicitly subscribe to the traps they want to be informed about.

The subnet administrator (SA) is a subnet database associated with the master SM to store different information about a subnet. The communication with the SA can help the end-node to establish a QP by sending a general service management datagram (MAD) through a designated QP, .e.g. QP1. Both sender and receiver require information such as source/destination LIDs, service level (SL), maximum transmission unit (MTU), etc. to establish communication via a QP. This information can be retrieved from a data structure known as a path record that is provided by the SA. In order to obtain a path record, the end-node can perform a path record query to the SA, e.g. using the SubnAdmGet/SubnAdmGetable operation. Then, the SA can return the requested path records to the end-node.

The IB architecture provides partitions as a way to define which IB end-ports should be allowed to communicate with other IB end-ports. Partitioning is defined for all non-SMP packets on the IB fabric. The use of partitions other than the default partition is optional. The partition of a packet can be defined by a 16 bit P_Key that consists of a 15 bit partition number and a single bit member type (full or limited).

The partition membership of a host port, or an HCA port, can be based on the premise that the SM sets up the P_Key table of the port with P_Key values that corresponds to the current partition membership policy for that host. In order to compensate for the possibility that the host may not be fully trusted, the IB architecture also defines that switch ports can optionally be set up to do partition enforcement. Hence, the P_Key tables of switch ports that connect to host ports can then be set up to reflect the same partitions as the host port is supposed to be a member of (i.e. in essence equivalent to switch enforced VLAN control in Ethernet LANs).

Since the IB architecture allows full in-band configuration and maintenance of an IB subnet via SMPs, the SMPs themselves are not subject to any partition membership restrictions. Thus, in order to avoid the possibility that any rough or compromised node on the IB fabric is able to define an arbitrary fabric configuration (including partition membership), other protection mechanisms are needed.

M_Keys can be used as the basic protection/security mechanism in the IB architecture for SMP access. An M_Key is a 64 bit value that can be associated individually with each individual node in the IB subnet, and where incoming SMP operations may be accepted or rejected by the target node depending on whether the SMP includes the correct M_Key value (i.e. unlike P_Keys, the ability to specify the correct M_Key value—like a password—represents the access control).

By using an out-of-band method for defining M_Keys associated with switches, it is possible to ensure that no host node is able to set up any switch configuration, including partition membership for the local switch port. Thus, an M_Key value is defined when the switch IB links becomes operational. Hence, as long as the M_Key value is not compromised or “guessed” and the switch out-of-band access is secure and restricted to authorized fabric administrators, the fabric is secure.

Furthermore, the M_Key enforcement policy can be set up to allow read-only SMP access for all local state information except the current M_Key value. Thus, it is possible to protect the switch based fabric from un-authorized (re-)configuration, and still allow host based tools to perform discovery and diagnostic operations.

The flexibility provided by the IB architecture allows the administrators of IB fabrics/subnets, e.g. HPC clusters, to decide whether to use embedded SM instances on one or more switches in the fabric and/or set up one or more hosts on the IB fabric to perform the SM function. Also, since the wire protocol defined by the SMPs used by the SMs is available through APIs, different tools and commands can be implemented based on use of such SMPs for discovery, diagnostics and are controlled independently of any current Subnet Manager operation.

From a security perspective, the flexibility of IB architecture indicates that there is no fundamental difference between root access to the various hosts connected to the IB fabric and the root access allowing access to the IB fabric configuration. This is fine for systems that are physically secure and stable. However, this can be problematic for system configurations where different hosts on the IB fabric are controlled by different system administrators, and where such hosts should be logically isolated from each other on the IB fabric.

Controlled Re-Routing

In accordance with an embodiment of the invention, route connectivity can be dynamically maintained within a fabric.

FIG. 2 shows an illustration of reconfiguring an IB fabric in accordance with an embodiment of the invention. As shown in FIG. 2, a SM 201 in an IB fabric 200 can be used to re-route the fabric when a connectivity change 202 is detected. The connectivity change 202 can happen when the IB fabric 200 is physically re-configured, e.g. when a cable is added or removed, or when a disaster strikes the fabric.

The SM 201 may try to automatically re-route the fabric after a change in connectivity is detected. This approach may be problematic because that the fabric connectivity during intermediate stages of cabling may not represent a legal topology that can be routed, or that the fabric connectivity may not be routed without negative impact on other parts of the operational fabric. The automatic re-routing approach can not guarantee the fabric state to be atomic since the SM may be able to detect multiple intermediate changes. The SM can detect the multiple intermediate changes despite attempts to make sure that all relevant components are powered down or disabled while the actual re-cabling or other component insertion/removal is taking place. For example, the SM may observe the multiple intermediate changes, (1) when the components that are to be added to the fabric are not powered on until all relevant (cable) connectivity has been established, and/or (2) when components that are to be removed are shut down before any cables are removed.

FIG. 3 shows an illustration of supporting a run-time mode for a SM in an IB fabric in accordance with an embodiment of the invention. As shown in FIG. 3, a SM 301 in an IB fabric 300 can include a run-time mode 310. In such a run-time mode 310, the SM 301 can be prevented from performing re-routing operations when it detects that the connectivity in the fabric changes 302.

The benefit of this approach is that the changed fabric connectivity can be verified, or accredited, before any re-routing is actually performed by the SM 301. On the other hand, when an operational connectivity 303 is temporarily modified, additional mechanisms can be used to avoid the lack of re-routing that may cause this connectivity not to be restored until the complete re-configuration has taken place. For example, the temporary connectivity change can be caused (1) by an operator error, or (2) due to a need to change an existing cable during the overall re-construction, or (3) due to the performance of a trivial change like moving one end of a cable from one connector to another that both are associated with the same switch chip.

FIG. 4 shows an illustration of supporting a SM that can maintain a record for known nodes in an IB fabric in accordance with an embodiment of the invention. As shown in FIG. 4, the SM 401 in the fabric 400 can keep a node record 410, which can be a list that includes information about each known node in the fabric. Such node information may include different hardware global unique identifiers (GUIDs) in the fabric and the known connectivity 403 that exists within the fabric, e.g. links connecting ports on various nodes in the fabric.

The SM 401 can operate in a normal mode, where the SM 401 can update the node record 410 and perform re-routing dynamically.

Additionally, the SM 401 can operate in a controlled re-routing mode, in which mode the SM can evaluate the connectivity change events 402 in the fabric and then determine if each connectivity change involves an existing connectivity or a new connectivity. If a connectivity change represents a new connectivity, then this new connectivity may be ignored until an explicit instruction to perform a new complete re-routing is received, and then the record of the known nodes and connectivity is also updated. On the other hand, if this connectivity change represents a lost connectivity (e.g. a switch node or a switch-switch link has disappeared), then the re-routing may try to compensate for the lost connectivity. Additionally, there can be a special case where a changed connectivity between two switch nodes is equivalent to an old connectivity. In this case, the changed, but equivalent, connectivity may not be considered as a new connectivity.

The benefit of this approach is that an established operational fabric configuration can be maintained without any negative impact caused by partially implemented additional connectivity. Also, the SM 401 can be implemented as a conventional SM.

On the other hand, the replacement of a node with an equivalent node in the existing operational fabric may not take place until the performing of a complete re-routing of all current connectivity. Additional mechanisms can be used to avoid that the replacement of a particular node in a major reconstruction of the fabric depends on the completion of other parts of the reconstruction, which can be time consuming and inconvenient.

FIG. 5 shows an illustration of supporting a SM that can track an existing operational configuration in an IB fabric in accordance with an embodiment of the invention. As shown in FIG. 5, a SM 501 in an IB fabric 500 can use a state logic that allows the SM 501 to keep tracking connectivity change 502, e.g. a change on the connectivity between specific node types at specific locations 503 in an IB fabric, without depending on tracking the unique hardware GUIDs.

In an IB fabric 500, the topology is recognized as a collection of node types, where the way these node types are interconnected defines the specific location of the node type in the topology. The location of a node in the topology is defined by the fabric connectivity, which may or may not correspond to a geographical location of a node. For example, the geographical location of a node can represent a card type plugged into a specific slot in a switch or server chassis. Depending on associated fixed and cabled links, the geographical location of the node may not also represent a specific location in the fabric topology.

The benefit of this approach is that the replacement of components in the operational part of the fabric can be achieved without depending on a complete re-route operation. On the other hand, if a step-wise expansion of a fabric configuration is requested, then the routing of an intermediate operational sub-configuration may be in conflict with the routing of the complete configuration or a larger sub-configuration. Thus, the connectivity in the original sub-configuration may only be maintained with an interruption during re-routing of a larger configuration.

FIG. 6 shows an illustration of a SM that can maintain a blueprint fabric configuration in an IB fabric in accordance with an embodiment of the invention. As shown in FIG. 6, a SM 601 in an IB fabric 600 can have state logic that allows the SM 601 to maintain a blueprint fabric configuration 610. The blueprint fabric configuration 610 can be used as a basis for evaluating a fabric configuration actually discovered, such as a new or a changed fabric configuration, when a connectivity change 602 is detected. If the discovered configuration and connectivity 603 are in agreement with the blueprint fabric configuration registered with the SM 601, then the discovered configuration and connectivity can be correlated to the blueprint fabric configuration, and thereby be used for routing as a subset of the complete blueprint configuration. Otherwise, if the discovered connectivity 604 is in conflict with the blueprint fabric configuration, then the discovered connectivity can be ignored from the routing perspective, and can be reported by the SM as being in conflict with current blueprint.

Configurations that represents well defined subsets of the blueprint configuration can optionally be routed more optimally than in the blueprint case. For example, the static load distribution in a subset of a larger fat-tree topology can use all the spine switches and links that are actually presented rather than establishing a currently imbalanced scheme with the assumption that more switches and/or links will be added later.

In accordance with an embodiment of the invention, more than one alternative blueprint fabric configurations can be defined. The actual routing of various fabric configurations can be done up-front using topology simulations. Then, taking advantage of the results of the topology simulations, a fast discovery and distribution of switch forwarding table contents can take place at run-time, thereby reducing the total SM level re-configuration time.

This approach can optimize the handling of both static fabric configuration and dynamic expanding fabric configurations. In some cases, trying to match the actual physical configuration with one blueprint among a very large set of alternative blueprints may require significant processing overhead. Furthermore, the SM 601 can maintain an ordered list of multiple preferred blueprints that the actual physical configuration should be matched against.

FIG. 7 illustrates an exemplary flow chart for supporting controlled re-routing in an IB fabric in accordance with an embodiment of the invention. As shown in FIG. 7, at step 701, a SM can detect a connectivity change in an IB fabric. Then, at step 702, the SM can re-route the fabric and ensure that only already accredited components and/or connectivity are utilized in the re-routing. In the meantime, the components and connectivity that are not yet accredited are ignored for normal data traffic, e.g. the routing and path set-up logic may not take into account of the connectivity that is not accredited. Finally, at step 703, the complete connectivity may still be explored, and represented within the local subnet or sub-subnet. Thus, maximum availability and routing stability can be achieved in a well configured part of an IB fabric while allowing dynamic re-configuration, e.g. the expanding, reducing, and changing of the physical connectivity, in other parts of the IB fabric.

The present invention may be conveniently implemented using one or more conventional general purpose or specialized digital computer, computing device, machine, or microprocessor, including one or more processors, memory and/or computer readable storage media programmed according to the teachings of the present disclosure. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.

In some embodiments, the present invention includes a computer program product which is a storage medium or computer readable medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the present invention. The storage medium can include, but is not limited to, any type of disk including floppy disks, optical discs, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalence. 

1. A method for supporting controlled re-routing in an InfiniBand (IB) network, comprising: detecting, via a subnet manager, a connectivity change in the fabric, re-routing the fabric, via the subnet manager, and ensuring that only accredited components and connectivity are utilized in the re-routing, and representing the connectivity that is not accredited within a local subnet or sub-subnet.
 2. The method according to claim 1, further comprising: allowing the fabric to be physically re-configured.
 3. The method according to claim 1, further comprising: ignoring connectivity that is not accredited in the re-routing.
 4. The method according to claim 1, further comprising: allowing the fabric to be automatically re-routed after a connectivity change is detected.
 5. The method according to claim 1, further comprising: preventing the subnet manager from performing re-routing operations after the subnet manager detects a connectivity change, when the subnet manager is in a run-time mode.
 6. The method according to claim 5, further comprising: verifying the connectivity change in the fabric before performing the re-routing.
 7. The method according to claim 1, further comprising: keeping a node record that includes information about each known node in the fabric.
 8. The method according to claim 7, further comprising: evaluating the connectivity change in the fabric and determining whether each connectivity change involves an existing connectivity or a new connectivity.
 9. The method according to claim 8, further comprising: if the connectivity change represents a new connectivity, then performing the steps of ignoring the connectivity change until an explicit instruction to perform a complete re-routing is received, and updating the node record.
 10. The method according to claim 8, further comprising: if the change represents a lost connectivity, then trying to compensate for the lost connectivity in the re-routing.
 11. The method according to claim 8, further comprising: if the change represents a changed but equivalent connectivity, then determining that the connectivity change does not involve a new connectivity.
 12. The method according to claim 1, further comprising: tracking only connectivity between specific node types, and/or other existing operational configuration at specific locations in the fabric.
 13. The method according to claim 1, further comprising: maintaining a blueprint fabric configuration via the subnet manager, wherein the blueprint fabric configuration can be used as a basis for evaluating a discovered fabric configuration.
 14. The method according to claim 13, further comprising: correlating the discovered fabric configuration to the blueprint fabric configuration and using the discovered fabric configuration for routing as a subset of the blueprint configuration.
 15. The method according to claim 13, further comprising: ignoring a discovered connectivity from the re-routing, if the discovered connectivity is in conflict with the blueprint fabric configuration.
 16. The method according to claim 13, further comprising: allowing fabric configuration that represents a well defined subsets of the blueprint fabric configuration to be routed more optimally than in the blueprint fabric configuration.
 17. The method according to claim 13, further comprising: defining the blueprint fabric configuration using up-front topology simulations, and performing a fast discovery and distribution of switch forwarding table contents at run-time.
 18. The method according to claim 13, further comprising: defining more than one alternatives for the blueprint fabric configuration, and maintaining the more than one alternatives in an ordered list.
 19. A system for supporting controlled re-routing in an InfiniBand (IB) network, comprising: a subnet manager associated with a fabric, wherein the subnet manager operates to perform the steps of: detecting a connectivity change in the fabric, re-routing the fabric and ensuring that only accredited components and connectivity are utilized in the re-routing, and representing the connectivity that is not accredited within a local subnet or sub-subnet.
 20. A non-transitory machine readable storage medium having instructions stored thereon that when executed cause a system to perform the steps of: detecting, via a subnet manager, a connectivity change in a fabric, re-routing the fabric, via the subnet manager, and ensuring that only accredited components and connectivity are utilized in the re-routing, and representing the connectivity that is not accredited within a local subnet or sub-subnet. 